Goal: Load Twitter API into a data management system and write queries to retrieve data.

  • Dataset of Elon Musk’s most recent Tweets during 2015-2022, stored in csv format, where each tweet is in its own separate row object.
user_id screen_name followers_count statuses_count friends_count account_created_at verified
x44196397 elonmusk 95589999 18134 114 2009-06-02 TRUE

Introduction

Apache Spark is an open-source engine for large-scale parallel data processing known for its speed, ease of use, and cutting-edge analytics. It provides high-level APIs in general-purpose programming languages such as Scala, Python, and R, as well as an optimization engine supporting standard data analysis methods.

Azure Databricks is an analytics platform based on Microsoft Azure cloud services, enabling the latest versions of Apache Spark and open source libraries. Built with Spark capabilities, Databricks provides a cloud-based interactive workspace with fully managed Spark clusters. This allows users to work in a single, easy-to-use environment, create and configure clusters in seconds, and quickly execute Spark code.

Data Description

Twitter (Elon Musk 2015-2022): Dataset of Elon Musk’s most recent Tweets during 2015-2022, stored in RDS format, where each tweet is in its own separate row object. All Tweets are collected, parsed, and plotted using rtweet in R. In total, there are more than thousands of tweets in this dataset, including retweets and replies. All objects are to go into a single database.

Data Set Variables
1 status_id 14 hashtags 27 quoted_followers_count 40 retweet_location
2 created_at 15 symbols 28 quoted_location 41 retweet_description
3 user_id 16 media_expanded_url 29 quoted_description 42 retweet_verified
4 screen_name 17 media_type 30 quoted_verified 43 name
5 text 18 mentions_screen_name 31 retweet_status_id 44 location
6 source 19 quoted_status_id 32 retweet_text 45 description
7 reply_to_screen_name 20 quoted_text 33 retweet_created_at 46 followers_count
8 is_quote 21 quoted_created_at 34 retweet_source 47 friends_count
9 is_retweet 22 quoted_source 35 retweet_favorite_count 48 statuses_count
10 favorite_count 23 quoted_favorite_count 36 retweet_retweet_count 49 account_created_at
11 retweet_count 24 quoted_retweet_count 37 retweet_user_id 50 verified
12 quote_count 25 quoted_user_id 38 retweet_screen_name
13 reply_count 26 quoted_screen_name 39 retweet_followers_count

Twitter API

Once you have your twitter app setup and loaded the rtweet package in R, you can run the following command to generate a token for the Twitter data. The first thing that you need to setup in your code is your authentication and then create a token that authenticates access to tweets.

library(rtweet) # load rtweet package
twitter_token <- create_token(
  app = "mytwitterapp",
  consumer_key = "api_key", consumer_secret = "api_secret",
  access_token = "access_token", access_secret = "access_secret")
use_oauth_token(twitter_token) # authenticate via web browser

Now we search Twitter’s full archive API. We run the search_fullarchive command in order to be able to access historical tweets from a user. The example below captures all of Elon Musk’s tweets from January 01, 2010 to May 28, 2022.

df <- search_fullarchive(q = "from:elonmusk", n = 10000, env_name = enviroment_name, fromDate = "201001010000", toDate = "202205280000")
Data Set Preview:
created_at screen_name text favorite_count retweet_count quote_count reply_count is_quote is_retweet
2022-06-20 23:52:18 elonmusk (Degentraland?) Artificial Insemination? 6963 270 66 830 FALSE FALSE
2022-06-20 23:43:38 elonmusk AI gets better every day https://t.co/Lz5XfXRJjh 90161 9333 1075 8200 FALSE FALSE
2022-06-20 21:34:57 elonmusk Some great suggestions in the comments! 22970 1078 101 4214 FALSE FALSE

Questions of Interest

  1. (path finding) Display the thread (replies) of tweets (the tweet, time, id, in reply to id, user name with their screen name) posted by Elon Musk with screen_name in the order in which they were posted.
  2. (location) From which location have the tweets been most actively posted (most number of tweets)?
  3. (hashtags) Which hashtags does Musk use the most, and how many tweets are associated with these hashtags?
  4. (topics) What word does Musk mention the most in his tweets? What company products does Musk mention the most in his tweets? Products include Falcon 9, Starlink Satellites, Model 3 cars, etc.
  5. (trending) Are there any trends of what Musk tweets about the company?
  6. (nature of engagement) What is the percentage of different types of tweets (simple tweet, reply, retweet, quoted tweet) to their overall number of tweets?

Storing Data in Databricks

Building a Databricks workspace using an Apache Spark cluster.

First, we create a Databricks workspace from the Azure portal and then launch the workspace, which redirects us to the interactive Databricks portal. We create a Spark cluster from the Databricks interactive workspace and configure a notebook on the cluster. In the notebook, we can use either PySpark or SparkR to read data from a dataset into a Spark DataFrame. Using the Spark DataFrame, we can run a Spark SQL job to query the data.

R Code:

require(SparkR)
df <- read.df("dbfs:/FileStore/dfclean.csv", source = "csv", 
              header = "true", inferSchema = "true")
createOrReplaceTempView(df, "twitterTemp") # register table for SQL

The Azure Databricks system stores this data file in FileStore, located in the FileStore/ folder, which we then use the above SparkR command to read in the data as a dataframe.


Querying Twitter Data

Users can perform relational procedures on DataFrames using a domain-specific language (DSL) similar to R dataframes and Python Pandas. DataFrames support standard relational operators, including projection (select), filter (where), join, and aggregations (group by).

Query 1. Mentions

  • Track Elon Musk’s engagement with other twitter users and display the corresponding thread of tweets posted by Musk (tweet content, date posted, users in tweet).

To track Elon Musk’s engagement with people on twitter, we want to look into tweets containing conversations with and directed to other users. We begin by unpacking information for each of Elon Musk’s tweets that mention another person’s username. Specifically, mentions are a particular type of tweet containing other account usernames, preceded by the “@” symbol.

R Code:

dfMentions <- SparkR::select(df, "created_at", "mentions_user_id",
                     "mentions_screen_name", "text")
dfMentions <- SparkR::filter(dfMentions, dfMentions$mentions_user_id != NA)
createOrReplaceTempView(dfMentions, "dfMentions")
Mentions
created_at mentions_user_id mentions_screen_name text
2022-06-17 16:51:17 x1016059981907386368 x15854702 x10850192 x912969880852299776 teslaownersSV UAW GM klwtts (teslaownersSV?) (UAW?) (GM?) (klwtts?) Yup
2022-06-17 16:17:14 x34743251 SpaceX Liftoff! https://t.co/28eNKniMqe
2022-06-18 02:56:18 x87101095 x13298072 x34743251 x14171401 x1263491240336769026 x998140573008695296 x1957568797 x985686123123949568 x1285872552766734346 blueskykites Tesla SpaceX mayemusk WholeMarsBlog 28delayslater JohnnaCrider1 Kristennetten SirineAti (blueskykites?) (Tesla?) (SpaceX?) (mayemusk?) (WholeMarsBlog?) (28delayslater?) (JohnnaCrider1?) (Kristennetten?) (SirineAti?) (GailAlfarATX?) (DimaZeniuk?) (bevedoni?) (RationalEtienne?) (ashleevance?) (adamhoov?) (klwtts?) (RenataKonkoly?) For a couple of months, but, yeah, that looks like the place. Does Mark still live there?
2022-06-20 17:54:19 x1016059981907386368 x34743251 teslaownersSV SpaceX (teslaownersSV?) (SpaceX?) Super talented team at SpaceX

As shown above, there exist tweets containing multiple mentioned usernames within the body of the text, all grouped together in a single row. So now we must manipulate the data so that each mentioned user for a tweet forms its own row, which allows us to count the total number of times Musk mentioned a unique user.

dfMentions %>%
  tibble(user = str_extract_all(text, "@\\w+")) %>%
  tidyr::unnest_longer(user) %>%
  dplyr::count(user, sort = TRUE)

The above command uses the str_extract_all() function to extract the mentioned users for each tweet and unnest_longer() to transform the nested lists into tidy rows so that each row contains only one user. Lastly, we count the total number of observations for each unique user.

SQL Query:

SELECT mentions_screen_name, COUNT(*) AS n
FROM mentionDF
WHERE mentions_screen_name != 'NA'
GROUP BY mentions_screen_name
SORT BY n DESC;

Hence, SpaceX and Tesla are the most frequently mentioned users in Musk’s tweets. Linking conversations together, a reply is a type of tweet sent in direct response to another user’s tweet. Similar to mentions, replies allow users to direct tweets toward other twitter users and interact in conversations. Following the same general procedure above, we obtain the following results.


Query 2. Types

  • What is the percentage of different types of tweets (simple tweet, reply, retweet, quoted tweet) to their overall number of tweets?

The different type of tweets that exist are general tweets, mentions, replies, retweets, and quotes. The mention and reply tweet types are already defined and analyzed in the previous section. General tweets are original twitter posts containing text, photos, a GIF, and/or video, but do not include any mentions, replies, retweets, or quotes. Lastly, retweets and quotes are both re-postings of another person’s tweet, although quotes allow users to post another person’s tweet with their own added comment.

R Code:

nGeneral <- nrow(df[df$is_retweet == FALSE, ] %>% # removes retweets
                   subset(is.na(reply_to_status_id))) # removes replies
nMentions <- nrow(subset(df, !is.na(df$mentions_user_id)))
nReplies <- nrow(subset(df, !is.na(df$reply_to_status_id)))
nRetweets <- nrow(nRetweets <- df[df$is_retweet == TRUE, ])
nQuotes <- nrow(df[df$is_quote == TRUE, ])

tweetTypes <- data.frame(
  type = c("General", "Mentions", "Replies", "Retweets", "Quotes"),
  count = c(nGeneral, nMentions, nReplies, nRetweets, nQuotes))

In the above, we create five different data sets either containing only general tweets, mentions, replies, retweets, or quotes. We then count the number of observations for each data set and store it in a separate data frame containing the tweet type and its respective count.

Now, let’s display a table holding information for each of Elon Musk’s retweets and query the data to obtain the usernames of Musk’s most frequently retweeted users. To identify the most frequently retweeted users, we use tidyr tools to unnest, count, and sort each user from Musk’s retweets.

dfRetweet <- df %>% dplyr::filter(is_retweet == TRUE) %>%
  dplyr::select(
    retweet_created_at, retweet_text, retweet_user_id,
    retweet_screen_name, retweet_favorite_count,
    retweet_retweet_count, retweet_location) %>%
  dplyr::arrange(desc(retweet_created_at))

dfRetweet %>% group_by(retweet_screen_name) %>%
  tally(sort = TRUE)
retweets
created_at text user_id screen_name favorite_count retweet_count location
2022-04-27 23:41:36 Docking confirmed! https://t.co/RyPZBAv5Lo x34743251 SpaceX 128109 6890 Hawthorne, CA
2020-05-14 23:22:56 Breakthrough, Part Deux (LVCVA?) https://t.co/JQ7sjPXSfZ x859816394556284929 boringcompany 7278 649 Austin/Las Vegas
2016-12-05 06:33:41 We’re releasing Universe, a platform for measuring and training AI agents: https://t.co/bx7OjMDaJK x4398626122 OpenAI 3088 1759 NA
2016-03-31 20:10:58 Longest line EVER #Model3 https://t.co/EafMhK3BDC x13298072 Tesla 5487 3059 NA

Query 4. Hashtags

  • Which hashtags does Musk use the most, and how many tweets are associated with these hashtags?

Here we want to extract hashtags from the content of the Tweets data. The following command unpacks the hashtags column into an array of strings, followed by counting how many unique hashtags used by Elon Musk.

SQL Query:

SELECT hashtags, COUNT(*) AS hashtagcount
FROM twitterTemp 
WHERE hashtags != 'NA'
GROUP BY hashtags
SORT BY hashtagcount DESC;


Query 5. Topic Words

  • What word does Musk mention the most in his tweets? What company products does Musk mention the most in his tweets?

Figuring out what words are most common in Elon Musk’s tweets involves text mining tasks. Using tidy data principles, the first step is to clean up the text from our dataset by using lowercase and removing punctuation, usernames, links, etc. We then use various R tidy tools to convert the text to tidy formats and remove stop words.

R Code:

library(tidytext)
library(stringr)

dfWords <- df %>%
  dplyr::mutate(text = str_remove_all(text, "&amp;|&lt;|&gt;"),
         text = str_remove_all(text, "\\s?(f|ht)(tp)(s?)(://)([^\\.]*)[\\.|/](\\S*)"),
         text = str_remove_all(text, "[^\x01-\x7F]")) %>% 
  unnest_tokens(word, text, token = "tweets") %>%
  dplyr::filter(!word %in% stop_words$word,
        !word %in% str_remove_all(stop_words$word, "'"),
        str_detect(word, "[a-z]"),
        !str_detect(word, "^#"),         
        !str_detect(word, "@\\S+")) %>%
  dplyr::count(word, sort = TRUE)

In the above command, the pattern matching function str_remove_all() removes unwanted text and the unnest_tokens() function splits the text of each tweet into tokens, using a one-word-per-row format. We then use the str_detect() function to filter out words by removing stop words, unicode characters, and whitespace.


Sentiment Analysis

Here we use the syuzhet R package to iterate over a vector of strings consisting of the text from all of Elon Musk’s tweets in our dataset. To obtain the vector of tweet text, the plain_tweets() function from the rtweet package is used to clean up the tweets character vector to cleaned up, plain text. We then pass this vector to the get_sentiment() function, which consequently returns the sentiment values based on the custom sentiment dictionary developed from a collection of human coded sentences.

R Code:

round_time <- function(x, secs)
  as.POSIXct(hms::round_hms(x, secs))
sent_scores <- function(x)
  syuzhet::get_sentiment(plain_tweets(x)) - .5

df.sentiment <- gfg_data %>%
  dplyr::mutate(days = round_time(created_at, 60 * 60 * 24),
                sentiment = sent_scores(text)) %>%
  dplyr::group_by(days) %>%
  dplyr::summarise(sentiment = sum(sentiment, na.rm = TRUE))

Extending the above sentiment analysis, the next step is to understand the opinion or emotion in the text. First, we must clean the text from our dataset so that it’s in a tidy format. We accomplish this using the R function gsub() to replace unwanted text and the get_nrc_sentiment() function to get the emotions and valences from the NRC sentiment dictionary for each word from all of Musk’s tweet.

R Code:

txt <- c("rt|RT", "http\\w+", "<.*?>", "@\\w+", "[[:punct:]]", "\r?\n|\r", "[[:digit:]]", "[ |\t]{2,}", "^ ", " $")

cleanTweet <- as.vector(df$text)
cleanTweet <- grep::gsub(txt, "", cleanTweet)

textSentiment <- syuzhet::get_nrc_sentiment(cleanTweet)
nrc_sentiment <- cbind(df, textSentiment) %>% 
  dplyr::select(created_at, anger, anticipation, disgust, fear, 
                joy, sadness, surprise, trust, negative, positive)

In the above command, the gsub function replaces all occurrences of the given patterns and the get_nrc_sentiment function calculates the presence of eight different emotions and their corresponding valence. The resulting columns include the eight emotions disgust, fear, joy, sadness, surprise, trust and their respective positive or negative valence.


References

Brown, Leif, Jason Howell, and Mary McCready. 2022. “Quickstart - Run a Spark Job on Azure Databricks Workspace Using Azure Portal.” Microsoft Technical Documentation. https://docs.microsoft.com/en-us/azure/databricks/scenarios/quickstart-create-databricks-workspace-portal.
Kearney, Michael W. 2019. “Rtweet-Workshop.” Data Science and Analytics Presentation. University of Missouri School of Journalism: Informatics Institute. https://rtweet-workshop.mikewk.com/.
Kearney, Michael W., Francois Briatte, Andrew Heiss. 2019. “Rtweet: Collecting and Analyzing Twitter Data.” Journal of Open Source Software 4 (42): 1829. https://doi.org/10.21105/joss.01829.